ProgrammingThis forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
So I thought I'd post my findings on another case for discussion. (I hope this is appropriate on this forum.)
Scenario
A folder contains about 280,000 html files, all residing in monthly subfolders and following the same naming convention. Id's are numeric and of varying length.
Code:
data/<yyyy>/<mm>/event_<id>.html
I need to build up a file just containing the files' id's excluding a particular monthly folder.
My gast is flabbered; I had expected tuxdev's solution to be a little quicker, not almost an order of magnitude slower! Looking for what causes such poor performance, how does this variant perform:
Code:
while read file ; do
id="${file%.html}"
id="${id##*_}"
echo "$id"
done <<< "$(find ... | grep ...)"
My gast is flabbered; I had expected tuxdev's solution to be a little quicker, not almost an order of magnitude slower! Looking for what causes such poor performance, how does this variant perform:
Come on guys, it's the while loop that's slow. Except for small operations, using external programs is always going to be faster than doing it in bash.
I was surprised that the -prune wasn't faster, but I think it's because there's only 1 directory to prune so it doesn't save much. And since -path takes glob patterns it's slower than grep -F which take fixed strings.
Come on guys, it's the while loop that's slow. Except for small operations, using external programs is always going to be faster than doing it in bash.
Is it the while loop or the read? If the read is the underlying cause that bash doesn't buffer stdin (that would be a strange design choice)?
So it appears that read actually isn't buffering. I'm not certain, the code deals with a lot of different options so it's hard to follow.
However, I don't think it matters either way, here is a while loop without read:
Code:
#!/bin/bash
COUNT=40000 # maybe increase this if your computer is faster than mine.
echo use bash loop
declare -i i=0
time while ((i<COUNT)) ; do ((i++)) ; done
echo ========
echo use seq
time seq $COUNT >/dev/null
For me the while loop takes around 1.3 seconds and the seq takes 0.1 seconds. bash is over 10 times as slow and that's after I took out the echo statement.
Those are very educational figures, ntubski, and show the importance of measurement instead of blindly accepting received wisdom. For years I have accepted the plausible hypothesis, perhaps once true, that the fork+exec system calls to run an external program are slow relative to in-shell actions.
On my system the output from your script was:
Code:
use bash loop
real 0m0.426s
user 0m0.405s
sys 0m0.010s
========
use seq
real 0m0.071s
user 0m0.041s
sys 0m0.004s
I modified it to also test an equivalent for ((i=0; i<COUNT; i++)) loop and found it ~10% faster than the while loop. Then I added a call to /bin/true in the loop and found the /bin/true calls took ~0.0001 s each which is only ~100 times longer than each bare loop.
Presumably there is some caching and hashing which means the first call to /bin/true could have taken significantly longer than 0.0001 s.
For years I have accepted the plausible hypothesis, perhaps once true, that the fork+exec system calls to run an external program are slow relative to in-shell actions.
Well they are, but in the example here there are O(n) in-shell actions but O(1) external calls, so for large n the fork+exec overhead is negligible.
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.